A system for creating and manipulating generalized wordclass transition matrices from large labelled text-corpora

نویسندگان

  • Wilfried Bloemberg
  • Michael Kesselheim
چکیده

This paper deals with the training phase of a Markov-type linguistic model that is based on transition probabilities between pvirs and triplets of syntactic categories. To determine the o?timal level of detail for a set of syntactic classes we developed a systetn that uses a set-theoretical formalism to defiue such sets mid has some measm~s to comp~uce and c,ptimize them fildividually. In section two we describe the optimizafiou problem (hi terms of piediction, infoimation and economy requilements) and our approach to its solution. Section three introduces the system dlat will assist a lhlguist in h,'mdling the prediction and economy criteria and in the last section we plesent some slunple lemtlts that can be achieved with it.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...

متن کامل

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

Automated Verb Sense Labelling Based on Linked Lexical Resources

We present a novel approach for creating sense annotated corpora automatically. Our approach employs shallow syntacticosemantic patterns derived from linked lexical resources to automatically identify instances of word senses in text corpora. We evaluate our labelling method intrinsically on SemCor and extrinsically by using automatically labelled corpus text to train a classifier for verb sens...

متن کامل

Robust H_∞ Controller design based on Generalized Dynamic Observer for Uncertain Singular system with Disturbance

This paper presents a robust ∞_H controller design, based on a generalized dynamic observer for uncertain singular systems in the presence of disturbance. The controller guarantees that the closed loop system be admissible. The main advantage of this method is that the uncertainty can be found in the system, the input and the output matrices. Also the generalized dynamic observer is used to est...

متن کامل

Creating a Multilingual Collocation Dictionary from Large Text Corpora

This paper describes a system of terminological extraction capable of handling multi-word expressions, using a powerful syntactic parser. The system includes a concordancing tool enabling the user to display the context of the collocation, i.e. the sentence or the whole document where the collocation occurs. Since the corpora are multilingual, the system also offers an alignment mechanism for t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1988